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1 Introduction 


This essay will focus on the application of a general-purpose graphics processing unit 
(GPGPU) on an approximate string matching (ASM) algorithm. GPGPUs are graphics 
processing units that can be used for general-purpose calculations as opposed to solely 
graphics-based calculations. Due to GPUs having thousands of processing cores, they are 
extremely well versed at running thousands of small tasks simultaneously. This is 
referred to as parallel programming and it can lead to dramatic speed increases in 
specific scenarios. This essay will investigate to what extent using the GPU to process a 
Levenshtein's Distance-based ASM algorithm can increase the processing speed of the 
algorithm, leading to the research question: To what extent is the speed of a 
Levenshtein's distance approximate string-matching algorithm different when being 


processed on a CPU vs. on a GPU. 


2 Theory & Concepts 


2.1 Approximate String Matching (ASM) 


Approximate String Matching is the process of finding the closest, or n-closest 
matches of a given Search Term in a dictionary. To do so, given an array of strings 
which serves as our dictionary, named ‘D’, and a Search Term ‘S’, then for each string 
in D, where the current element index is X’, we must find the number of 
transformations required to morph Dx into S. There are a few valid types of 
transformations for Levenshtein's Distance, these arel!!: 

Insertion, adding a character into the query string: “Wrld” > “World” 
Deletion, removing a character from the query string: “Woarld” > “World” 
Substitution, replacing a character from the query string: “Wurld” > “World” 

The number of transformations between two strings is known as the Edit Distance, 
and there are several algorithms available to determine this distance. A couple 
examples include: The Longest Common Subsequence, The Hamming distance, The 
Jaro Distance, and the Levenshtein distance. The difference between each algorithm 
is which transformations they count as valid. For example, the Hamming distance 
only allows substitution, thus it only applies to strings ofthe same length. This essay 
will focus on an ASM algorithm built on the Levenshtein's distance paradigm, whose 
valid transformations are shown above. 

The process of finding the n-closest strings of a given search string in a brute-force 


manner can be summarized in the following 3-steps: 


Step 1: For each string in the algorithm's dictionary: 
Step 2: Compute the Levenshtein Distance, store the score and the current 
dictionary string within a binary search tree. 
Step 3: Once each string has been computed, traverse the tree in order and output 
up to n strings. 

While ASM is a niche field, it has had a large influence in not only computer science, 
but also external fields, such as biology, among others. It plays a crucial role in 
several real-world problems. For example, detecting plagiarism, bioinformatics, 
digital forensics, spell checkers, spam filters, and search enginesl2l. In certain cases, 
such as search engines, the dictionary of strings to search from can become massive. 
There are existing optimizations to speed up the time it takes to complete an ASM 
query, such as indexing, which reduces the total number of strings we must iterate 
through using some indexing method, such as the first couple characters of each 
string. Existing ASM optimization methods are all software-based, but unless we find 
a software-based optimization with an O(1) runtime - meaning it would take the 
same amount of time to run regardless of the size of the input, which as far as we 
know is impossible - software-based optimizations can only speed up our query up 
to a point. This is where the massively parallel nature ofthe GPU comes in. Being 
able to take advantage of a GPU's immense parallel computing capabilities can 
theoretically dramatically increase the speed of ASM on large databases. Combined 
with the possibility for a data centre to possess dozens, to hundreds of computers 
each with GPUs installed within, having a parallelized version of ASM could allow for 


blazing fast ASM queries even for massive dictionaries. 


2.2 Levenshtein”s Distance 
Levenshtein's Distance (LD) is a method of calculating the Edit Distance between 2 


strings that considers the previously discussed Insertion, Deletion, and Substitution 
operations. Unlike some other methods, it does not incorporate Transposition 
(Swapping the positions of two characters). 
This essay focuses on the LD algorithm as opposed to other Edit Distance algorithms 
due to its ease of implementation as well as it being able to consider three Edit 
Operations. Calculating the Levenshtein Distance of two strings will be done using 
the following matrix, shown in Figure 2.2.1 below: 
0 ++ len(S) +1 
len(D,) +1 ki LD 
Figure 2.2.1: Matrix used to determine Levenshtein's Distance 
Where ‘S’ is the Search Term, "D is the current Dictionary Term, and the function 
‘len()’ returns the number of characters in the inputted term. Initially, this matrix is 
empty. To fill it, there are a couple of possible methods. This essay will use an 
Iterative method with a full matrix, I chose this method as GPUs are known to be 
able to accelerate matrix-based calculations. The algorithm used in an iterative full 
matrix approach to finding LD involves traversing through the matrix in row-major 
order with two for-loops, then setting the element at the current coordinates given 


by the for-loops to the result of the following piecewise function shown in Figure 


2.2.2 below: 
max(x, y), if min(x y) = 0, 
M(x — 1, y) 
Mix, y)= min M(x,y — 1) else 


M(x — 1,y — 1) + LD y, as,) 


Figure 2.2.2: Function used to determine the LD in matrix ‘M’ at indices x, y 


2.3 Graphical Processing Unit 


The Graphical Processing Unit (GPU) is a piece of hardware that is most commonly 
connected to a computer via a serial expansion bus, such as PCIe, a peripheral 
connection interface which allows for the highspeed transfer of dozens of gigabytes 
per second. This amount of speed is required to have reasonable interoperability 
speeds between the CPU and GPU; For example, one of the most common uses of a 
GPU is real-time computer graphics (Hence the name). To achieve real-time speeds, 
potentially several gigabytes of data stating what and where to draw things on the 
screen must be transferred between the computer's Main Memory to the GPU's 
onboard memory via a PCIe expansion bus. The transfer of data from Main Memory 
to GPU Memory poses an overhead. There are 3 major sources of overhead when 
programming on the GPUBI: 
o CPU Wrapper Overhead: This is the overhead created by the wrappers around 
GPU API De, OpenCL / CUDA) functions, which are called from the CPU. 
o Memory Overhead: This is the overhead created by moving data back and forth 
between Main Memory and the GPU's memory. 
o GPU Launch Overhead: This is the overhead created by the time it takes for the 
GPU to retrieve the command given to it and begin executing it. 
GPUs are structured differently from CPUs. The main difference is that whereas 
modern CPUs have between two and 64 cores or so, with most consumer processors 
containing four to 16 coresl4l; modern GPUs have two to three thousand specialized 
cores. Additionally, whereas according to Flynn's Taxonomy, multi-core CPUs 
operate using Multiple Instruction, Multiple Data (MIMD) techniques, GPUs operate 


with Single Instruction, Multiple Data (SIMD) techniques. 
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Figure 2.3.1 (Left): A Diagram showing MIMD architecture 


Figure 2.3.2 (Right): A Diagram showing SIMD architecture 
This difference in processing architectures creates notable differences between how 
things are processed on each device. For example, the GPU’s or SIMD’s main 
advantage is that it can compute mathematical operations on large sets of data- 
points extremely quickly relative to other architectures, with minimal memory costs, 
as the instruction and data pool is loaded into the SIMD device's memory once and is 
then shared with the totality of the processing units. The main disadvantage of SIMD 
architecture is that not every algorithm can be efficiently applied to it. It also takes a 
considerable amount of extra human interaction to create SIMD / parallel 
programs.P] In contrast to SIMD, MIMD's main advantage is that it is trivial to 
program, as there is no explicit need for communication between processing units!él, 


since every processing unit has its own memory. 
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3 Methodology & Testing 


3.1 Preface 


Primary experimental data is the main source of data for this paper, for which two 
logically identical programs were created. The first implemented in Java, which is to 
be ran on the CPU. The second will be a kernel implemented in a variation of the C 
programming language specialized for OpenCL. Both programs will be provided in 
this paper’s appendix. 

An experimental approach - where I conduct an experiment to create primary data - 
was chosen to answer this paper’s research question due to a lack of broad 
secondary data to answer the question. While some papers provided information on 
parallelized string-matching algorithms, most used a specific API or algorithm. For 
example, the paper “Using GPUs to Speed-Up Levenshtein Edit Distance 
Computation” used exclusively CUDA as its GPU APII. CUDA is only available on and 
is highly optimized for Nvidia GPUs, possibly resulting in higher speeds than we 
would expect for similarly powerful but differently branded GPUs. To avoid this 
issue, this paper aims to use cross-platform software to answer the research 
question at the most general scope possible, removing the performance 
enhancements specific hardware manufacturers may be able to give to their own 
hardware. To that end, both the Java Virtual Machine and the OpenCL APl are cross- 


platform. 


3.2 Dependent & Control Variables 


The variable I will be using to compare the CPU and GPU's performance is the 
average time taken to complete a single ASM query in milliseconds. While there are 
other factors that can be considered, such as memory usage or power draw, time 
will give us the most quantifiable measure of how performant ASM is on both 
devices. 

The average time taken was acquired by adding the time taken to complete each 
individual query, then dividing it by 30 (Number of repeats). This value was then 
converted to milliseconds by dividing it by 1,000,000. 

The time taken to complete an individual query was acquired using 2 calls to Java's 
System.nanoTime() function, surrounding the function that completes the ASM Query 
on a given device, as shown in the pseudocode below: 


long IndividualTime = System.nanoTime(); 
QueryCPULev (SearchTerm); 


TotalTime += System.nanoTime() - IndividualTime; 


Figure 3.2.1: Individual ASM Query Time Measurement Psuedocode 
System.nanoTime() was chosen due to its high-resolution source and because time 
was returned in nanoseconds. As such, I could be sure that the time values gathered 


were accurate and precise. 


Control Variables: 


e The amount of ASM Queries done to find the average (repeats). 
This will be set at 30 for every test. Because I am calculating averages, the 
more repeats, the better. However, the more repeats, the longer the execution 
time. 30 stroke a good balance between precision and time taken. 

e The hardware used for each test. This must be controlled as using different 
hardware will evidently cause changes in performance. 


3.3 Search Term Size - Hypothesis 


This first experiment conducted investigated the effect of different Search Term 


Sizes on the average time taken to complete an ASM Query. 


Search Term Size 
0 e len(S) + 1 


len(D,) +1 e LD 


Current Dictionary Term Size 


Looking back at figure 2.2.1 above, we can see that the size of the matrix increases as 
the size of the search term increases. This matrix must be filled for each term in the 
dictionary, meaning that if the search term size increases, more work must be done 
per dictionary term. 

Both the CPU and GPU will have to complete the same amount of work per matrix; 
However, the difference appears when we consider that the CPU only has one 
processing unit available to iterate through the dictionary and compute the matrix 
for each term. Here, we see the GPU’s massive advantage: it can assign each 
dictionary term to one of its thousands of processing units. 

The total work done by the CPU can be summarized as: O(S * D - D,), while 

the total work done by each GPU processing unit can be summarized as: O(S : D‚); 
Where 'S' is the length of the Search Term, ‘D’ is the length of the Dictionary, and "D. 
is the average length of a Dictionary Term. Looking at both equations, we can expect 
linear growth for the processing time taken for both the CPU and the GPU, however 
we can expect a much steeper gradient from the CPU due to the extra factor ‘D’. 
While each processing unit is doing the same amount of work to fill one matrix, 
because the work is divided across so many processors, I hypothesize that the effect 


of increasing the search time size will be much greater on the CPU than on the GPU. 


3.4 Search Term Size - Test 


The independent variable for this test was the Search Term Size, measured in 
character count. For example: “Hello” would have a size of 5. 

Testing was conducted with a dictionary of random integers up to but not restricted 
to 15 digits long. The dictionary consisted of 10,000 elements. The experiment had 
30 repeats. Search terms with the following sizes were tested: 


1,5,10,25,50,75,100,250,500,750,1000,2000,3000....8000,9000,10,000. 


Smaller values (From 1 to 1000) were used to investigate if there was a noticeable 
difference in performance at very small search term lengths, the thought being that 
due to the GPU’s initial overhead, small search term lengths might perform better on 


the CPU. 


Larger values (From 1000 to 10,000) were used to more clearly highlight the 
relationship between the dependent and independent variables. Small values do not 
highlight the relationship as well due to the very small differences in processing 


time between two search term sizes with a small difference between them. 


The test simply consists of: 


e Generating a random string of current Search Term Size 

e Conducting a ASM Query on that string 30 times 

e Measuring the Overall time taken for all 30 queries, and the Average time 
taken for 1 query (By dividing Overall time taken by 30). 


e Repeat for each Search Term Size tested 


Percentage 

Difference of 

Overall Time 
CPU Average (ms) [GPU Overall (ms) [CPU Overall (ms) Taken 


Figure 3.4.1: Results of the test in tabular form 


Search Term Size vs. Average Time Taken to process ASM Query eGPU ecru 


DA 
DA 


Average Time Taken (ms) 


0 r een en On © 


1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
Search Term Size 


Figure 3.4.2: Results of the test in graphical form 
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3.5 Dictionary Size Test — Hypothesis 


Unlike in the test in section 3.4, here we are not measuring the effect of changing the 
size of the matrix in which we evaluate Levenshtein's Distance. Instead, as can be 
seen in figure 3.5.1 below, we measure the effect of increasing the number of 


matrices - the number of terms for which we calculate a Levenshtein's Distance for. 
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Figure 3.5.1: 3D Visualization of the Matrices used to evaluate the LD of every term in a dictionary 
The reason for conducting this test is that the GPU operates with ‘work-groups’. 
These work-groups represent a portion of data that the GPU must execute the 
current kernell with. As the GPU does not have an infinite amount of processing 
units, the GPU maintains a pool of work-groups, which the processing units of the 
GPU then retrieve one-by-one until the pool is emptied!8l, at which point the GPU has 
no more work to do and the task is complete. 

My kernel was designed in such a fashion that each dictionary term represented a 


single work-group. (Kernel Architecture is a point which will be discussed later) 


1 A kernel is a set of user-defined instructions to be executed specifically on the GPU. It can be thought of as a 
function within regular CPU programming. 


This test will therefore compare the impact of the GPU having to swap out an 
increasing amount of work-groups as the number of dictionary terms outnumber 
the number of processing units available on the GPU, with the CPU who can simply 


process the dictionary sequentially with no need to retrieve data from a pool. 


I hypothesized that despite the potential added overhead of having to retrieve data 
from a work-group pool, the divide-and-conquer advantage that the GPU has thanks 
to its thousands of cores will still allow it to vastly outperform the CPU. This is 
exacerbated by the fact that modern GPUs possess exceedingly fast memory - the 
GPU used for every tests memory has a bandwidth of 448GB/sPl compared to the 
RAM's average read/write speed of +7GB/s. This means that while the GPU does 


have a pool of work-groups to retrieve work from, it can do it extremely quickly. 


Even if my hypothesis is correct and the GPU is still faster than the CPU, the results 
of this experiment can be compared with the others to conclude which variable has 
the greatest impact on the time taken to execute an ASM query. That conclusion can 
be applied by developers to identify where to optimize their software to get the 
fastest implementation of ASM possible. Additionally, there may be a difference in 


which variable causes the greatest impact depending on the device used. 


3.6 Dictionary Size Test — Test 


The independent variable for this test was the Size of the Dictionary, measured in 

the number of terms within it. 

Testing was conducted with dictionaries of random integers up to but not restricted 

to 15 digits long. The dictionaries had the following sizes: 
1,10,50,100,500,1000,2500,5000,7500,10000,20000,30000,....100000 

The experiment had 30 repeats. 

As with the previous experiment, smaller values from 1 to 10,000 were used to 

investigate ifthere was a noticeable difference in performance with very small 

dictionary sizes. 

While larger values (From 10,000 to 100,000) were used to more clearly highlight 


the relationship between the dependent and independent variables. 


Methodology: 


e Generate a random string with length 15. 

e Conducting a ASM Query on that string 30 times with the current Dictionary 
Size 

e Measuring the Overall time taken for all 30 queries, and the Average time 
taken for 1 query (By dividing Overall time taken by 30). 


e Repeat for each Dictionary Size to test 


Percentage Difference |Difference of Overall 
Dictionary Size (Terms) [GPU Average (ms) [CPU Average (ms) [GPU Overall (ms) [CPU Overall (ms) [of Average Time Taken |Time Taken 


Figure 3.6.1: Results of the test in tabular form 
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Figure 3.6.2: Results of the test in graphical form 
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3.7 Additional Graphs 


Search Term Size vs. Percentage Difference between Average Time Taken to 


complete ASM Query on the GPU and CPU. 
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Figure 3.7.1: Graph highlighting the percentage difference between the average time 
taken per query on both devices when search term size is varied. 


Dictionary Size vs. Percentage Difference between Average Time Taken to 
complete ASM Query on the GPU and CPU. 
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Figure 3.7.2: Graph highlighting the percentage difference between the average time 
taken per query on both devices when dictionary size is varied. 
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4 Conclusion 


4.1 Effect of varying Search Term Size 
To begin, it should be stated that due to how both Search Term Size (STS) and 


Average Dictionary Term Size shape the Levenshtein's Distance matrix, as seen in 
figure 2.2.1, it can be concluded that both variables have very similar effects on the 
performance of an ASM query. 

From the experiment conducted in section 3.4, we can conclude that STS is linearly 
correlated with the average time taken to complete an ASM query. More 
importantly, it is clear that the GPU caused a massive performance increase over the 
CPU. We know this as we consistently saw the GPU completing ASM queries over 
900% faster than the CPU from 8000 to 10000 search term characters. 

However, as can be seen in figure 3.7.1, it appears as though the percentage 
difference between the time taken for each device is logarithmically corelated. This 
means that while the performance increase achieved by using the GPU increases 
rapidly for smaller STSs, it begins to stagnate for larger STSs. In my case, the 
stagnation occurred at around 5000 to 8000 search term characters with a 
percentage difference of around 950% to 1100%. 

However, in real-world applications, STS values realistically never reach the larger 
experimental values I used. For example, according to WolframAlpha, the average 
length of an English word is 5.1 characters. Hence, for an English Spellchecker that 
uses ASM, you would expect the Average STS to be 5.1 as well. We saw in figure 3.4.1 
that with an STS of 5, we can expect a percentage difference of 154% or a 76ms 
decrease in time taken from CPU to GPU. While a performance increase is achieved, 


it is almost negligible due to its magnitude. 


One final consideration is that because the Search Term and the Dictionary's Terms 
are often interrelated (Ifthe Search Term is an English word, then the Dictionary is 
likely to be a Dictionary of English words, for example.), we may also conclude that 
both variables (STS and Average Dictionary Term Size) will vary the performance of 
an ASM query similarly. Nevertheless, while there was no space to do so in this 
paper, it might be worth conducting an experiment to see the performance impact of 


increasing both variables or solely Average Dictionary Term Size. 


4.2 Effect of varying Dictionary Size 


Ås seen in the experiment conducted and graph in sections 3.6 and 3.6.2, the 
relationship between Dictionary Size and average time taken to complete an AQM 
query is exponentially corelated. This means that a lot more time is required per 
ASM query as Dictionary Size increases. 

Up to around 50-100 terms, we can see that the GPU is actually taking more average 
time to complete a query than the CPU. This is likely to be the result of GPU 
overhead. 

Looking at figure 3.7.2, the corelation is somewhat indeterminate. It would be 
advisable to gather more data past 50000 terms to verify a corelation. Based on the 
data gathered, it appears as though an exponential corelation is most fitting. If this is 
correct, then it means that the GPU results in an increasing performance increase 
from the CPU as the Dictionary Size increases. While this is a very positive 
realization, it is cancelled out by the exponential growth of the time taken per query 
which we've already seen in figure 3.6.2 - while an increasing amount of time is 
saved from using the CPU, an also increasing amount of time is taken to process a 
query. 


I believe a developer wanting to optimize their ASM implementation should 


prioritise reducing the size of the dictionary used in ASM. This can be done with 
methods such as Indexing and Suffix Trees, methods discussed in section 2.1. Further 
reading on those methods can be found in a paper by Dekel Tsur entitled “Fast index 


for approximate string matching'.110] 


4.3 Comparison 
Comparing the two variables used in this paper, Search Term Size (STS) and 


Dictionary Size (DS), we see that on the GPU, the DS generally had a very minimal 
effect on the time taken to complete a query up until about 20000 terms. Even at 
20000 terms, a query took less than a second to complete, and took less time to 
complete than a query with an STS of 10000. Due to the theorized exponential 
growth of the percentage difference between the time taken on the CPU and GPU 
when Dictionary Size is varied, there was actually a quite small difference in 
performance between the CPU and GPU for the first couple tens ofthousands of 
terms. However, after 20000 terms, the time required per query increased 
drastically. In contrast to the DS, the STS required a lot more time per query even at 
much smaller values. 

In conclusion, the GPU consistently provided a performance increase over the CPU. 
Itis clear that a Levenshtein's Distance based ASM algorithm does gain value and is 
faster when ran on the GPU. Looking at the algorithm itself, varying the Search Term 
Size (and likely the Average Dictionary Term Size as postulated at the end of section 
4.1) for the most part had a lesser effect on the time taken to complete an ASM query 
than changing the Dictionary Size; Therefore, it's advisable for a developer to focus 
on optimizing the Dictionary Size first and foremost to improve the performance of 


their ASM application. 


5 Extensions 


5.1 Multithreading 


Something to consider is that the CPU code I used was not made to utilize the 
multiple cores a modern CPU has. By utilizing only one core of the CPU, we are 
wasting a lot of power. I think an interesting thing to explore in an additional 
research paper would be the effect of multithreading on the performance of an ASM 


algorithm. This is a fairly important consideration due to the next point: 


5.2 Server CPUs 


Server CPUs often have many more cores available to them than home or desktop 
CPUs. The importance of this comes when we look at the previous point. If the CPU 
code takes advantage of the dozens or even hundreds of cores that a Server CPU may 
have, could it achieve better performance than a GPU at least with a small enough 
Dictionary Size or other variable? This is important as ASM algorithms may be 
implemented in the cloud. To save costs on GPUs for the cloud servers, an owner 
may prefer to only use the CPU, at which point, having an efficient, multithreaded 


CPU ASM algorithm would be highly beneficial. 


5.3 Kernel Architecture 


My final consideration is that coding on the GPU is a nuanced process. Contrary to 
CPU programming, there is a lot of freedom with how things are processed and how 
memory is handled. As such, there are more ways of optimizing algorithms to 
maximise the use of the thousands of cores present on the GPU. I believe exploring 
different manners of processing ASM on the GPU, and/or exploring how to best 
arrange the memory passed to the GPU may an interesting avenue for future 


research. 
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6.2 General Code 


Here you will find most java classes used in gathering data for this paper. 


Main.java 


Compute.ComputeProgram 
Compute.GPU 


ava.io.BufferedReader 
ava.io.IOException 
ava.io.InputStream 
ava.io.InputStreamReader 
ava.nio.FloatBuffer 
ava.nio.IntBuffer 
ava.nio.charset.StandardCharsets 
ava.util.ArrayList 
ava.util.Scanner 


tap (A lake bleke GR NE 


org.lwjgl.opencl.CL10.* 
Main ( 


Scanner 
Sean 


jua O oa REECH ao MSE) | 
FunctionStartTimer = System.nanoTime() 
ArrayList<String> WordList = ArrayList<>() 
de Eet 
Main. .getClassLoader () .getResourceAsStream(List) 
(BufferedReader BR = BufferedReader ( 
InputStreamReader (IS, StandardCharsets. 
String Line 
((Line = BR.readLine()) != 
WordList.add (Line) 


(IOException e) { 
e.printStackTrace() 


} 
String[] WordListArray = String[WordList.size()] 


(Sezilas Nore 2 Momus) I 
(Word.length() > 


= Word.length() 


FunctionTimeTakenl = (System.nanoTime() - FunctionStartTimer) 


System. PEIEE 
FunctionTimeTakenl, FunctionTimeTakenl / 1000000) 


= GPU. eelløoerasd( 
geit | Selen 


* WordList.size() 


= GPU. see een ( 
WordList.size()) 
Gering $ 8 Morse) 4 


(int j = 07 3 < 
(9 < Solenciia()) 4 
paes ehar AcE 


„put (0) 


EE Eleng (0) 


= GPU. eelleerae (il) 
put (0, WordList.size()) 


= GPU. eel ee ihm (dl) 
„put (0 ) 
FunctionTimeTaken2 = (System.nanoTime() - FunctionStartTimer) 


System. TP E 
FunctionTimeTaken2, FunctionTimeTaken2 / 


WordList.toArray (WordListArray) 


QueryCPULev (String SearchTerm) 
FunctionStartTimer = System. nanoTime() 
SearchTr LevenshteinTr = SearchTree ( 
LevenshteinData ( SE) 


SearchTerm + SearchTerm.toLowerCase () 


(String Base : DE 
Levenshtein.Calculate(Base, SearchTerm, LevenshteinTree) 


FunctionTimeTakenl = System.nanoTime() - FunctionStartTimer 
( ) System. SJØ ene: ( 
FunctionTimeTakenl, FunctionTimeTakenl / 100000 


( WO 
LevenshteinTree.PrintInorder (LevenshteinTr 


QueryGPULev (String SearchTerm) ( 


FunctionStartTimer = System.nanoTime() 


SearchTerm = SearchTerm.toLowerCase() 


IntBuffer SearchTermBuffer = 
.callocInt ( (SearchTerm) .length() ) 
( CurrentChar : SearchTerm.toCharArray()) { 
SearchTermBuffer.put (CurrentChar) 


) 


SearchTermBuffer.position(0) 


IntBuffer SearchTermLengthBuffer = GPU. „eel leserne) 
SearchTermLengthBuffer.put(0, SearchTerm.length()) 


FloatBuffer OutBuffer = GPU. Jealloerloat( 


ComputeProgram SolverProgram = 
„get ( ) 


Flags = | 
lverProgram. teWriteIntBuffer (0, SearchTermBuffer, Flags) 
lverProgram. teWritelntBuffer (1, SearchTermLengthBuffer 


lverProgram. teWriteIntBuffer (2 Flags) 
lverProgram. teWritelntBuffer (3 Flags) 
lverProgram. teWriteIntBuffer (4 Flags) 


lverProgram. teIntBuffer (5 Flags) 
lverProgram. teFloatBuffer(6, OutBuffer, Flags) 


lverProgra 
olverProgram. 


FunctionTimeTakenl = System.nanoTime() - FunctionStartTimer 
( ) System. Former 
FunctionTimeTakenl, FunctionTimeTakenl / 1000000) 


lverProgram.AutoSetKernelArgs () 
lverProgram.AutoEnqueuelD() 
lverProgram.ReadFloatBuffer(6, OutBuffer) 
FunctionTimeTaken2 = System.nanoTime() - FunctionStartTimer 
( ) Systen. JOE MEE ( 
FunctionTimeTaken2 - FunctionTimeTakenl (FunctionTimeTaken2 - 
FunctionTimeTakenl) / 


SearchTr LevenshteinTr = SearchTree ( 
LevenshteinData ( TESE) 


TermIndex = 
(OutBuffer.hasRemaining()) { 
LevenshteinTr „Insert ( SearchTree ( 
LevenshteinData ( [TermIndex++] 


OutBuffer.get()) 
)) 


FunctionTimeTaken3 = System.nanoTime() - FunctionStartTimer 


) Svsren JOE ( 
FunctionTimeTaken3 - FunctionTimeTaken2 (FunctionTimeTaken3 - 
FunctionTimeTaken2) / 1000000) 
) System. TOERNEE 
FunctionTimeTaken3, FunctionTimeTaken3 / 1000000) 


IA 


ishteinTr .PrintInorder (LevenshteinTr 


mesa (Setas 1] args I 
GPU.Init() 


GPU. AddProgram( 


.getClassLoader() .getResourceAsStream ( 


= LoadWordList ( 


Scanner (System. 


( Sé 
System. 5 Ore sialic ar ( 
String SearchTerm = .nextLine 


( ) I 
( ) System. Peme 
QueryGPULev(SearchTerm) 
) System. ica JL 


) System. ae la (( 
QueryCPULev(SearchTerm) 
System. EE Le la ()) 


( 
System. reacia (( ) 
Tests.DictionarySizeGPU() 
System. State lia ()) 
System. SES ) 
Tests.DictionarySizeCPU() 


} 
GPU.Dispose() 


End of Main.java 


Tests.java 


java.util.Random 


Tests ( 
String GenRandString ( Length) { 
Random R = Random () 
StringBuilder Result = StringBuilder () 
( a = 07 < sacada) stars) Á 
Result.append ( ( DA ET se IN mesa loci (0) + 


Result.toString() 


SearchTermSize() { 


Main.QueryCPULev(GenRandString(10)) 


REssats = 
[] SearchTermSizes = 


me 100 DOD Oi 


00} 
System. omana 
System. 5 Ove at rare Lia ()) 
( SearchTermSiz : SearchTermSizes) { 
String[] TestSearchTerms = String [Repeats] 
( 1 = 0; i < Repeats; itt) { 
TestSearchTerms[i] = GenRandString(SearchTermSize) 


SuperTotalTime = System. nanoTime() 

TotalTime = 

(String TestSearchTerm : TestSearchTerms) { 
IndividualTime = System.nanoTime() 


Main. QueryCPULev (TestSearchTerm) 
TotalTime += System.nanoTime() - IndividualTime 


System. Pr LIANE IE 
SearchTermSize) 

System. PEINE 
System.nanoTime() - SuperTotalTime (System.nanoTime() - SuperTotalTime) / 


00000) 


System. Monster 
(TotalTime / Repeats) (TotalTime / Repeats) 


DictionarySizeGPU() { 
ESpeats 


SuperTotalTime = System.nanoTime() 
TotalTime 


Main, Ouer 
TotalTime + 


PEINE 
tem.nanoTime() - Supe 
000) 
em.out.printf ( 
(TotalTime / 


SuperTotalTime 
TotalTime = 0 


em.out.printf ( 
Repeats) (morce kime / R 
} 


End of Tests.java 


epeats 


R++) { 
= GenRandString(15) 


- ITime 


otalTime S em.nanoTime() - SuperTotalTime) / 


(TotalTime 


System.nanoTime() 


R++) { 
GenRandString(15) 


.nanoTime() 


- ITime 


‚stem.nanoTime() - SuperTotalTime) / 


(TotalTime 
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SearchTree.java 


SearchTree ( 


SearchTree 
LevenshteinData 
SearchTree 


SearchTree (LevenshteinData Dataln) 
= Dataln 


Insert (SearchTree Node) { 
19 Direction 


1.equals ( 


Wou 
Node 


. Insert (Node) 


(Direction.equals ( 


. Insert (Node) 


TllegalArgumentException () 


PrintInorder(SearchTree Node 


Estate [Oil >= kime |p al ||) 


tInorder (Node. 

Kama JO < iee 

(!Node. £ .equals ( 
Node. TOWE 


PrintInorder (Node. 


End of SearchTree.java 


Limit) 
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6.3 CPU Specific Code 


Here you will find most classes that were exclusively required by the CPU. 


Levenshtein.java 


java.lang.Math.max 
java.lang.Math.min 


Levenshtein { 


Calculate (String BaseTerm, String SearchTerm 
SearchTree Tree) { 


String OperatingOriginalString t BaseTerm 


[] [] Matrix = 
[ OperatingOriginalString.length() ][ SearchTerm.length() ] 


y « Matrix. wer) dl 
x = Op = < Meee 07 . x+ 


(min(x, y) == 
Matrix[y] [x] 


{ 
Terml = Matrix[y - 
Term2 = Matrix 
Term3 = Matrix[y - 
(OperatingOriginalString.charAt (y) != 
SearchTerm.charAt (x)) { 
Term3++ 


min(Terml, min(Term2, Term3)) 


Distance = Matrix [OperatingOriginalString.length() - 
1] [SearchTerm.length() - 1] 

Totallen = (OperatingOriginalString.length() - 1) + 
(SearchTerm.length() - 1) 


Ratio ) (TotalLen - Distance) / ( ) TotalLen 
Score )Distance + (Distance == 0 ? 0 : Ratio) 


Tree.Insert ( SearchTree ( 
LevenshteinData (BaseTerm, Score) 


)) 


CalculateTreeless(String Original, String Search) 
String OperatingOriginalString = + (OreuGubmell 


[1[] Matrix = 
[ OperatingOriginalString.length() ][ Search.length() ] 


y SS EE end visse) A 


{ 


End of Levenshtein.java 


erm2, Term3)) 


derer Set lencia) = 
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LevenshteinData.java 


Ba (SHE due RE 


End of LevenshteinData.java 
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6.4 GPU-Specific Code 


Here you will find Java classes that connect or run GPU code. 


ComputeProgram.java 


Compute 


ava.io.* 

ava.nio.Buffer 
ava.nio.ByteBuffer 
ava.nio.IntBuffer 
ava.nio.charset.StandardCharsets 
ava.nio.file.Files 
ava.nio.file.Paths 


Ai EE E is 


org.lwjgl.PointerBuffer 


Java.nio.FloatBuffer 
java.util.HashMap 


Compute.GPU.* 
Compute.InfoUtil.checkCLError 
Eras, ent eme CED + 
org.lwjgl.system.MemoryUtil.* 


org.apache.commons.io.* 


ComputeProgram ( 


String LoadSource (String Path) Exception ( 
Files.readString(Paths.get(Path)) 


ComputeProgram ( Context, String SourcePath) ( 
String Source = 
( 
Source = LoadSource(SourcePath) 
(IOException E) ( 
System. EE SE + SourcePath + 


) 


String SourceFilename = FilenameUtils.getBaseName(SourcePath) 


Program = clCreateProgramWithSource(Context, Source 


BuildProgram(Program) 


= clCreateKernel(Program, SourceFilename 
checkCLError ( ) 
clReleaseProgram(Program) 


ComputeProgram ( Context, InputStream IS, String ProgramName) 


StringBuilder Source = StringBuilder () 


(BufferedReader BR = BufferedReader ( 
InputStreamReader(IS, StandardCharsets. 
String Line 
((Line = BR.readLine()) != 
Source.append (Line) 


(IOException e) ( 
System. [Ssi a (| 
ProgramName + ) 


} 


Program = clCreateProgramWithSource(Context 
Source.toString() ) 


BuildProgram (Program) 


= clCreateKernel (Program, ProgramName 
checkCLError ( ) 
clReleaseProgram(Program) 


BuildProgram ( Program) ( 
(elBuildProgram(Program, GPU. 
Wou 


ByteBuffer BuildLog = 
PointerBuffer BuildLogSize = PointerBuffer.allocateDirect (1) 


( clGetProgramBuildInfo(Program 
BuildLog, BuildLogSize) == 

Systen. .println (BuildLogSize.get (0) ) 

[] BuildLogBA = [at )BuildLogSize.get (0)] 
BuildLog.get (BuildLogBA) 
{ 
System. ice a (( String (BuildLogBA (Bi 
(UnsupportedEncodingException e) ( 

e.printStackTrace() 


SCE 


HashMap<Integer, Long> 
HashMap«Integer, Long? () 


CreateFloatBuffer( ArgumentIndex, FloatBuffer Capacity 
Flags) ( 
MemoryObject = clCreateBuffer( Elas, Capeacılı.y 
) 
checkCLError ( ) 
. put (ArgumentIndex, MemoryObject) 


WriteFloatBuffer ( ArgumentIndex, FloatBuffer Data) { 
checkCLError(clEnqueueWriteBuffer( 


.get (ArgumentIndex) 


CreatelntBuffer ( 


ArgumentIndex 


)) 


IntBuffer Capacity 


MemoryObject 
) 
checkCLError ( 


JO 


) 


t (Argument Index 


WriteIntBuffer ( 


clCreateBuffer( 


ArgumentIndex 


Flags, Capacity 
MemoryObject) 


IntBuffer Data) { 


checkCLError(clEnqueueWriteBuffer( 


.get (ArgumentIndex) 


CreateWriteFloatBuffer( 
Jet teen Flags) ( 
CreateFloatBuffer(ArgumentIndex 


Data )) 


ArgumentIndex, FloatBuffer 


InitialData, Flags) 


WriteFloatBu 


Crea 
InitialData 
CreatelntBuf 


Flags) 


ffer (ArgumentIndex 


teWritelntBuffer ( 
{ 


fer (ArgumentIndex 


InitialData) 


ArgumentIndex 


InitialData 


IntBuffer 


Flags) 


WriteIntBuffer(ArgumentIndex, InitialData) 


AutoSetKernelArgs() { 


. forEach ( (ArgumentIndex > 


( 


MemObject) 


MemObject) 


clSetKernelArglp( ArgumentIndex 


}) 


AutoEnqueuelD() { 
PointerBuffer GlobalWorksizeBuffer 
GlobalWorksizeBuffer.put (0 


llocPointer (1) 


GPU 
) 


PointerBuffer LocalWorksizeBuffer locPointer (1) 


LocalWorksizeBuffer.put (0 


Bu = GPU. 


Pointer ffer KernelEvent Scale Ponne AN 


clEnqueueNDRangeKernel( 


GlobalWorksizeBuffer 
NES « 
LocalWorksizeBuffer 


Kernel! 


clWaitForEvents (KernelEvent) 


ReadIntBuffer ( OutputArgumentindex, IntBuffer Buffer) 


clEnqueu 


gument In 


gumentI 


t (OutputArgumentIndex 


End of ComputeProgram.java 


GPU.java 


Compute 


org.lwjgl.PointerBuffer 
org.lwjgl.opencl.CL 
org.lwjgl.opencl.CLCapabilities 
org.lwjgl.opencl.CLContextCallback 
org.lwjgl.system.MemoryStack 


java.io.InputStream 
java.nio.IntBuffer 
java.util.HashMap 


Compute.InfoUtil.checkCLError 
org. LWI ee Ee Ek 
org.lwjgl.opencl.CL11. 
org.lwjgl.system.MemoryUtil.* 


GEUR 


MemoryStack 
Nte 


LCapabilities 
LCapabilities 


LContextCallback 


HashMap<String, ComputeProgram> 
HashMap- String, ComputeProgram> () 


AddProgram (String Name, String Path) { 
ComputeProgram Program = ComputeProgram ( 


Benn 
.put (Name, Program) 
AddProgram(String Name, InputStream IS) 
COENEN EE EH S ComputeProgram ( 


IES) 
Name 


. put (Name, Program) 


GetPlatformAndDevice (MemoryStack Stack) 
IntBuffer PlatformCount = Stack.mallocInt (1) 


( 


PointerBuffer AvailablePlatforms = Stack.mallocPointer (1) 


checkCLError ( 


clGetPlatformIDs (AvailablePlatforms, PlatformCount) 


(PlatformCount.get (0) == 0) { 
RuntimeException ( 


= AvailablePlatforms.get (0) 
= CL.createPlatformCapabilities( 


IntBuffer DeviceCount = Stack.mallocInt (1) 
PointerBuffer AvailableDevices = Stack.mallocPointer (1) 
checkCLError ( 
clGetDevicelDs( 
AvailableDevices, DeviceCount) 


) 


(DeviceCount.get (0) == 0) { 
RuntimeException ( 


= AvailableDevices.get(0) 
= CL.createDeviceCapabilities( 


) 


PointerBuffer GetContextProperties(MemoryStack Stack) 
PointerBuffer ContextProperties = Stack.mallocPointer (3) 


ContextProperties 
¿One (10; 
jaime (il 
“jure. (O) 


ContextProperties 


GetContextCallback() { 
= ClConexutallbadk. Creare Leuten, Påivere nto 


do, user astra) => I 
System. iia la (( ) 
System. SES Lak (( + memUTF8(errinfo)) 


}) 


Pest baie) 4 
StringBuilder SB = StringBuilder() 


nd 


de 


+ InfoUtil.getDeviceInfoStringUTF8( 


nd : ) 
Type = InfoUtil.getDeviceInfoLong( 
(( )Type) { 


SB.append ( 


SB.append ( 


SB.append ( 


SB.append ( 
InfoUtil.getDeviceInfoStringUTF8 ( 


SB.append ( 
InfoUtil.getDeviceInfoStringUTF8( 
) 


SB.append ( 
InfoUtil.getDeviceIlnfoInt( 
) 


SB.append ( 
InfoUtil.getDevicelnfoInt( 
+ ) 


SB.append ( 
InfoUtil.getDevicelnfoLong( 


SB.append ( 
InfouUtil.getDeviceInfoInt( 
) 


SB.append( .getSize() + 


System. State la (SIB) 


Taie) 1 
= MemoryStack.create (500 


schel Love iia, (au) 
GetPlatformAndDevice ( 


GetContextCallback () 


= clCreateContextFromType (GetContext Properties ( 
0 ) 
checkCLError ( ) 


= alCreateCommandQueue( 


) 
checkCLError ( 


PrintInfo() 


Dispose() ( 
. forEach( (Name, Program) -> { 
Program. .forEach ( (Key, MemObject) 
clReleaseMemObject (MemObject) 


}) 


Program. .Clear() 


End of GPU.java 
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InfoUtil.java 


OPSREL EE > 
.system.MemoryStack.* 
.system.MemoryUtil.* 


aro Ue N 


TESE 4 


PrintDeviceInfo( 

System. PEINE) 

System. EE 
InfoUtil.getDeviceInfoStringUTF8 (Device 


DeviceType = InfoUtil.getDevicelnfoLong (Device 
) 

ng DeviceTypeString = 

DeviceType == 

DeviceTypeString 


(DeviceType = 
DeviceTypeString 


{ 
DeviceTypeSt 


DEA 
cial 
Emiel 
eviceInfoInt 
caba dL 
eviceInfoInt 
sjort 
{Emiel 
tDeviceInfoLon 
tem. = Ove ae] 
tDeviceInfoLo 
tem. poem 


+ DeviceTypeString) 


+ 
Device )) 


Device 


DIG DB ce (Bl == ER 


Device 


) 


Q 


=) 


a Geceler ormi oS eeng ASELI el joleerosmn Lel 
param name) ( 
(MemoryStack stack = stackPush()) { 
PointerBuffer pp = stack.mallocPointer (1) 
chéckerkrron (ENG SE BI at Ormia platforn 10, paran mame 
(ByteBuffer) pp)) 
bytes = )pp.get (0) 


ByteBuffer buffer = stack.malloc (bytes) 
checkCLError(clGetPlatformInfo(cl platform id, param name 


memASCII (buffer, bytes - 1) 


String getPlatfo fostringUTF8 el pleitos el 
param name) ( 
(MemoryStack stack = stackPush()) { 
PointerBuffer pp = stack.mallocPointer (1) 
CheckeLERnor (El Get Plat Form EE param name 
(ByteBuffer) PEIN 
bytes = ( )pp.get (0) 


ByteBuffer buffer = stack.malloc (bytes) 
CheckeLirrorn (CH Get Bat Form Fo (el platcform Aa, param mame 


)) 


memUTF8 (buffer, bytes - 1) 


getDe ESSIG ( el device el param name) ( 
(MemoryStack stack = stackPush()) { 

IntBuffer pl = stack.mallocInt (1) 
checkCLError(clGetDevicelnfo(cl device id, param name, pl 


pl.get(0) 


el dewles el param name) 


(MemoryStack stack = stackPush()) ( 
LongBuffer pl = stack.mallocLong (1) 
lee Ee Cera ro (Cl devies ad, pers mens, jou 


pl.get(0) 


el device ud 
{ 
(MemoryStack stack = stackPush()) { 
PointerBuffer pp = stack.mallocPointer (1) 
enaedtunasor(eltesnsvueenro(el celdas ad, Sara mene, jeje 


pp.get (0) 


String getDeviceInfoStringUTF8 el devices ad 
param name) ( 
(MemoryStack stack = stackPush()) ( 
PointerBuffer pp = stack.mallocPointer (1) 
ehacktuinasor(elteundesvuieenrø(el Fdevise ud, PEEN vene 
(ByteBuffer) pp) ) 
bytes = ( )pp.get (0) 


ByteBuffer buffer = stack.malloc (bytes) 
ENSCELE Front (eks Deviestars(ell demos ud, paran meme, US 


getProgramBuildInfolnt(long cl i 
name) { 


publi ic String getProgramBuildInfoStringASCII (long cl program id, 
lome el Ge 2 daram name) 


return mem 


be cheekEtErron (aa 
l= CI UCCE ) 


ew RuntimeExcep 


End of InfoUtil.java 


6.5 GPU Kernel 


Here you will find the GPU Kernel which solves ASM queries. 


LevenshteinSolver.cl 


ame ESD (ame 
return 


| void LevenshteinSolver ( 
jlobal const int *Term, 
lobal cons *SearchTe 
lobal cons *Base, 
lobal cons *BaseSize 
lobal cons 


in 


Im 
in 


Gr (er ier (a 


in 


ER le EE eooramatresı/ 


rmLen, 


lé 


eier lar des (On 


tert Coup 
lese seler “Goute 


E Cloos uel(0) 5 


“LongestBaseTermLen, 
DistanceMatrices, 


CurrentBaseTermLenFou 


t i = ooh << 
tanceMatrices 


rrentBaseTermLen = 0; 


L = LongestBaseTermLen[0]+1; 


r 


nd false; 


JE Oa < 7 a 


anceMatrices[F3D(0 


lé 


one (une 3x ; x < Search 
iim as sesjela sr A 
int CurrentSearchChar 
mor (tbm y > ily Sv < TÉ 
assigned to this work item*/ 


TermLen [0] ; x++) { /*For each character 


= Term[x 
In ya each character in the base 


int CurrentBaseCha 


r = Basel ( (G LEL) = W) = ELI; 


if (CurrentBaseCha 
break; 


e == 0) | 


CurrentBaseTermLen 
(CurrentBaseChar”*0)); 


if (min(x,y) == 0) 
DistanceMatric 


) 
else ( 
int EqTerml 


int EqTerm2 


int EqTerm3 


EqTerm3 += 1 * 


x (!(CurrentBaseTermLenFound”0) 


{ 
SS SID (Oe, I, vg SE, 


DistanceMatrices[F3D(x 


DistanceMatrices[F3D(x, 


DistanceMatrices[F3D(x 


(bool) (CurrentSearchChar“CurrentBaseChar) ; 


min (EgTerm2, EqTerm3 


End of LevenshteinSolver.cl 


min (EgTermi, 
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